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requires the examinee to produce a response- Since these responses 
must be evaluated, the factor of rater judgment influences the 
reliability of scores. The problem of scoring reliability is one 
which pervades the literature on creativity research, where either 
low estimates or no estimataB have been reported when tests from the 
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analysis of variance procedures. General principles for training 
raters and for analyzing the results of the design will be discussed. 
(Author) 



EDRS PRICE 
DISCRIPTORS 

IDENTIFIERS 
ABSTRACT 



ERIC 



U.S. DiPARTMENT OF HiALTH 
iDUCATiaNaWiLFARi ' 
□ FFlCiOFlOUCATIQpy 
THIS DOCUMENT HAS BEIN RiPRQ. 

- ^ E^y^GD iXAeTLV AS RECElViD FROM 

Lfl T^fi PERSON OR DflaANlZATlQN ORIG 

g^^^ INATING JT POINTS OF VIEW OR OPjN^ 

«^ ^ 'OJ^S STATID do NOT NECESSARILY 

' REPRliiNT OFFICIAL OFFICE OF EDU- 

CATION POSITION OR POLICY 

o >:lmed from best available copy 



EFFECTS OF TRAINING 0^^ RATING RELIAlilLITY, AS ESTIMATED BY 
ANOVA PROCEDURES p FOR PLUEWCY TESTS OF CREATIVITY 



^mthia WllliaTns 
University of Pittsburgh 



Presantad at the Annual Meeting of the 
National Comeil on HeasOTement in Education 
New Orleans 5 Loulsianna 
February 3 1973 



EFFECTS OF TRAIMING ON RATING RELIABILJTY, AS ESTirWTED BY 
ANOVA PROCEDURES, FOR FLUENCY TESTS OF CREATIVITY 

Cynthia L, Williams 
University of Pittsburgh 

Msasures of creative mental abilities, such as the divergent 
production battery developed by Guilford and his colleagues (e,^, , 
Guilford, Wilson, S Christensan, 1952? Guilford, Ktttner, £ Christansen, 
19Sk)^ require the examinees to produce a response, given some basic 
information. Since these responses muit then be evaluated, the factor 
of rater judgenient influences the reliability of response scores. Re^ 
search on the problem of rater judgeTnTOt has indicated that raters , in 
general 5 tend to differ from one another in the scoring criteria applic^d, 
.to change the scoring criteria for different individuals being rated, 
and to differ with respect to the distribution of grades throughout the 
score scale (Coffmanv 1971), 

The problem of scoring reliability is one which pirvades the 
literatwe on creativity research, where tests trom the divergent produc- 
tion battery are often employed. Many studies which utilize tests from 
the battery do not report estimates of scoring reliability (Fulgosi 6 
Guilford, 1968; toopley, 19671 Cline, Richards, fi Haedham, 1963; 
Christansen, Guilford, 5 Wilson ^ 1957). mm scoring reliability 
estimates are given ^ they are typically low* For mmmplQ^ Shin (1971) 
reports scoring reliabilities of 0,81t 0,79, 0*78^ 0.74, and 0*68 for 
two raters of the five diver sent 'production tests used in his study. 
Curiously, the manuals aQCompanylng the divergent production tests 
do no'u contain information regwding scoring reliability t Rather ^ 



alternate form reliability coefficients are reported, rinally, reports 
b? the factor analytic studies on the structure of Jntellecr model f'?om 
which the divergent production battsry was derived do not include 
scoring reliability estimates , although various internal consistenoy 
coefficients are reported (Gershon^ Guilford , & Merrifleld, 1963; Guil- 
ford, Christerisen* Mck, 5 Iferrifield, 1957; Guilford , Herrifield, £ 
CoK, 1961; Hoeprner 6 Guilford s 1965);' 

As was Indicated previously, the lack of consistent scoring 
criteria across raters produces scoring unreliability'* A firmly held 
belief is that rating errors can be minimiied and scoring reliability 
increased by the careful training of raters (Guilford 5 1964). Thus 5 
the inajor purDOss of the resewch undertaken was twofold. First, pro^ 
cedures for training rateri to scora protocola froni the Utility Test, 
a test ftom the divergent production battery, were developed. The Utll^ 
ity Test was selacted from the battery of tests available because of 
its wide use in the creativity research ' literature. Also, this test 
provides a measure of ideational fluency 5 factor which has been sug^ 
gested as a pr evasive element in the measurement of creativity (Fulgosi 
& Guilford, 1968 Chri&tensenj Guilford ^ £ Wilson, 1957 i Clark fi MlrelSt 
,1970; Shin, 1971)* Secondly ^ the procedwas were evaluated for their 
effectiveness in increasing scoring reliability. In addition to the 
effects of trainings the factor of scoring order was investigated. 
Since the Utility Tast contalne two parts ^ one could question whether 
the scores assigned by raters are a function of the order in which the 
raters scored each p^t, that is, scoring P^t I first and Part II second 
as contrasted with scoring Part II first and Part I second. One could 
also question whether the factor of sequence of scoring systematically 



influencas the scoraa assigned. Tho presanca of a sanuence of feet v/ould 
indicate that the avarage score asaiRned to thosa protocols scored first 
differ from the average score assigned to those soorad second, reffardless 
of the test part. Ona final factor investigated was whether tha average 
scores of the two parts of tha Utility Teat were equal. The investiga- 
tion of the order, sequenae, and test part variables provides Infornia- 
tion tangential to the major purpose of the study, but allows one to 
eKamine potential sourcjes of variation in the general rating situation. 

HETKOD 

A. Developm arit of the Training Procsdur e 

In daveloping the training proeedwa, rtferenee was made to 
other types of measuring devices which use ratings, such as essay examina- 
tions and projective techniques, as well as other measures of creativity. 
Davelopment of consistent scoring standards across all raters appeared 
to be the major eoncern of researchers u^ing such devicas and several ■ 
general principles for training raters were Identified. The first step 
in a training program should be . one of devaloping the concept of interest 
and of establishing a rationale for the measuring procedure. Secondly, 
the scoring procedure ahould be made as objective as possible, leaving 
little room for questions ft'om the raters (Grant £ Caplan, 1957). Ngn- 
overlapping response categorias should be developed and defined precisoly. 
In addition, examples of typlaal responses occurring in each category " 
should be liicluded. Rater- practice In the use of the scoring procedure 
is a crucial aspect of the training (Tomklns, 1947; Anderson, 1960 | Feldt, 
1962; Eisner, 1965). In conjwiotlon with these practice sessions, dis- 
cussions regarding rating dlserspancies should be held (Tomkins, 1947 ^ 
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Eisner, 1965; Faldt, 1952). rinally, Guilford (1964) has Bu^ostBd that 
tha raters be made aware of tha various rating errors, such as laniency 
errors, relative halo effects, and contrast errors. 

Impartinff to the rater knowledge about tha conctruct is a pri- 
mary objective in rater training. This process typically includes a 
deHnition of the construct' and/or a rationale for the tasting pro- 
cedure. In the manual for the Torrance Testa of Craativa Thlnkins 
(Torranca, 1965, p. 19), '^the importance of familiarity with the 
rationale of the test tasks and the ooncepts of fluency, flexibility, 
originality, and elaboration" is emphaaized. Also, included in tha 
scoring guide for this test (pp. 6=16) isa discussion of the rationale 
for both the figural and varabl tasks. The introduotion to the developed 
training materials contains a brief discussion of divergent production 
md the structure of Intellect model » as proposad by Guilford. The basic 
.factors of fluency, flexibility, orginality, elaboration, redefinition, 
and sensitivity are briefly defined. Since the function of the program 
is to train raters to score protocols for fluency, a description of the 
fluaney factor and of some of the proposed measures from the divergent 
production battery is also Included in the Introductory sections. 
Finally, a disoussion of scorinff reliability and of some sources of 
rating errors, which can produce scoring unreliability, is provided. 

The crucial aspect of the objectiflcation of the acorlng pro- 
cadure is the definition of the scoring categories." The fl^cy score 
ascribed to an individual's protocol (a set of rasponses to a specified 
task) is the total number of aqcaptable responses produced. In scoring 
a protocol for fluency, the rater must olassify ejch particular rerpOnse 
as either acceptable or unacceptable. As stated in the technical man- 
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udl accompanying the Utility Test, an accaptabla ideational fl^im%- 
sponse has the definlnR oharacteristlc of relavance (Wilson, Merrlfiald, 
fi Guilford, 1962). However, the manual for the Plot -^.tles test, tho 
responsas to which can also be scored for ideational fluencj^ Indicates 
that an" resnonsa which is relevant , but not a duplicate of a previous 
response, is acoeptabla (Bargar 6 Guilford, 1969). mLU the manuals 
accbmpanying the divergent production tests do not- aKpllcitly define 
the terms relavant and duplicati i n, an attempt was made In the developed 
training matarials to defina more clearly these two characteristics of 
an acceptable response. 

For the Utility Test, the examlnea must write as many uses as 
he can for a brick (Part 1) and for a wooden pencil (Part II). A rele- 
vant response in this oontext must be an exampla of a possible use for 
a brick or for a wooden pencil. The critical word is use_. In general. 
Indicates putting soma object into service for an Intended pijrpose. 
The nieaning of the word use stresses the praotieality of the object fop 
aehiaving some, desired outcome or result. The training materials eon- 
tain a table of some possibla. aategorles of usas for a brick and for 
a wooden pencil. The suigestion to the raters is to use the list as 
a device for familiarizing thamsalvas with some types of uses which 
might be encountered in scorinf, responsas, but not to regard it as a 
complete listing. In defining the response characteristic of dupllca.- 
tion, four situations are describad in the training matarials. 

1. If the responss is an exact replication of a previous response, the 
response ooci^ing second on the list would be a duplicate., 

2. Another way in which a. response can be a duplicate is if the response 
is synonyroous with a provious response. For example ^ with reference to 
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a use of a wooden pencil, "bite It" is essentially synonyinoua with the 
response ^^chew it If these two dispenses occim^ed on a protocol, 
whichever one occ^ed second ori the list would be a duplicate, 
3, Another situation in which duplication occurs is when a responae is 
either a specific or a general case of a previous response. The response 
which occurs second is a duplicate of the first response. To illustrate 
this type of duplication, the responses ''m^zQ a list** and ^'make a grocery 
list^^ can be considered. If the response "mate a liat'^ gccurred first 
and '^make a grocery list'' occurred second on the protocol, then '^make 
a grocery list'* is a duplicate, since it is a specific case of the pre- 
viously given general case '^mak6 a list However, if '^make a grocery 
list'^ occurred first ^nd ^^make n llst^-oc^urrad sdqdld^^tHen^ "j^aki' ft :iist'' 
is a duplicate , since it is a general use of a previously given specif ie 
ease **maka a grocery list*'' 

4, The final situation in which duplication occurs is related to the 
previous situation of ^eneral/speeific duplication. In this situation 
what is varied in the response series is the type of specific case. 
For example, in the series of responses '^wrlte a story "write a poams'' 
'wite a speech, ^'^^TOite letters,- all of the responses are subsumed 
under the more general response "to write However, only the responses 
subsequent to the first given would be considered duplicates , 

Given the definitions of relevance and duplication ^ the rater 
must then follow specific rules for classlftrlng a given response as ac- 
ceptable or unacceptable* 

1, If the reiponse is relevant and Is not a duplication of a previous 
response 3 the response is categorised ai accept^le. 

2t If the response is relevant, but is also a duplication of a previous 



■reipDnse, the response is eategoi^lged as unacceptable. 

3, If the response Is Irrelevant , It Is autOiMtically categorized as 

imicoeptable , 

To illustrate the preaedlng rules , the process of catefforliing the re- 
sponses is presented in the training inaterialii as a. flow chart, which 
is given in Figtune 1, In the oatagorlzatlon of a specific response, the 
rater must consider two questions. First, does the response provide 
a relevTOt eKample for the task requested? Seoondly, if the response 
is relevant, is it a d uplieate of a previous response? t^hen the rater 
has ansuerad th^se two questions, utilizing the definitions of ralevane# 
and QupliGatlons the response has been categorized as either acceptable 
or imaoceptable. 

The final section of the training materials Incoiporatai three 
suggaetions for rater trainings provision of eKamples of acceptable anc^ 
unacceptable responses, practice in scoring smpla responses, and dis- 
GUGslona of rating discrepancies* This section of the training materials 
is structured so that each rater scores the brick responses of three 
^individuals and then scores the wooden pencil responses, but the scor- 
ing process is done in conjimctiOT with the, training manual. The re- 
sponses of the three ■'individuals'^ to both parts of the Utility Test 
were developed to provide raters with examples of relevant and irrelevaiih 
responses and the types of duplication outlined previously, ' mm a rater 
begins scoring the sample protocols, the instructions in the training 
mats?ials indicate that he is to consider the first rasponse and decide 
whether the response is acceptable or imacceptabla , In order to compare 
his deelsion with a standOTd, the rater then lifts the sUp of paper 
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FIGITO 2 

Procodur© Used in Scoring Protocols for Fluency 
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following the response. Beneath tha slips the Gorreet eatefforxgation 
is given. This process is repeated for each response given on 'the pro- 
tocol. With this structure of training mat er Lais ^ all raters can be 
exposed to the same Informaticn, where, if training were conducted with 
groups of raters 5 the specif ia inforTnation inay be contingent on the na^ 
ture of the group* Finally , a coding sheet was developed to provide 
raters with a method for recording their decisions. The coding sheet 
was designed to include a system for insuring that the number of tallies 
recorded for the aGceptable end unaeceptable responses sum to the total 
number of responses given, A coding sheet of this form should reduce 
the number of coding errors on the p^t of the rater. The use of the 
coding sheet is explained in the training manual and praotice in its 
use is provided during the scoring of the sample protocols. 

Prior to the evaluation of the preceding training materials , 
try-out sessions were conductad. Volunteers were administered the 
training materials and on completion, independently scored a sample 
brick protocol and a sample wooden pencil protocol. These protocols 
were developed to include examples of the relevance and duplication 
chi^aGterlstics of responses* The results of the try-out sessions in- 
dicated that the transition instructions between the definition of re- 
sponse characteristics and the scoring of sample protocols required 
clarification. No systematic errors in scoring were Indicated by an 
item by item analysis of rater scoring of the sample protocols. A 
final observation in the tiy=out sessions had implications for the e-- 
valuation Drocedwe. During the try*out sessions, the raters worked in 
the same room, CQmpetltlon bstween the Individuals to finish first or 
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to keep up with Others in the room indicated the Importance of raters 
working independently and in isolated conditions, 
B, Evaluation of the Developed Training Procedure 
Rriterg 

In most of the research utilising eonie portion of the divergent 
production battery, those responsibla for scoring protocols hnve included 
the principle raseareher and/or meniers of the staff, graduate or advanced 
undergraduate students, or teachers involved in the project (for e>^amplo, 
Cropley, 1967 | Clark S Hirals, 19701 SGhmadel, Merrifield, S Bonsall, 
1965; FulRosi £ Guilford, 1968; Shin, 1971), To sumnari^e, the general 
cleiss of raters utilized couid be best described as an adult, well'^ed- 
ucated, volunteer group. In the present study, volunteers were requested 
from graduate students in the Departnient of Educational Research, School 
of Education, University of Pittsburgh. In the request, students werri 
informed that raters were needed to score rasponsas to a ei-^eativity 
test Bnd that the task should take at most two hou^s to completft* 
Information rGgOTdlng the specific prDblem and the nati^s of the veMable 
being considered was withheld. Of the 21 students asked to FWtlcipate, 
20 volunteered their services , 
Ins trument at ion and ProtocQls 

The Utility Teat purports to measure the structure of intellect 
factor of ideational fluency. This test is compQied of two pgrts and 
in each p.art the examinee is required to wlte as many possible uscss an 
he can for a specified object. In Part I the object is a brick and in 
Part II 5 a wooden pencil. Five minutes are allotted to ea^h part. 

Protocols for the subtests of the divergent produotlon battery, 
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ineludinfT the Utility Test, were availahla from a prevloua invaatlgition 
(Shlni 1971). In June, 1971, tasts from the divergant production bnt- 
tary were adininistarad to 125 el-aventh grade students of n suburbnn 
Pittsburgh school district. From this pool of students, 20 individuals r ■ 
were randoiTily selected. The responses of theBe 20 individuals to the 
Utility Test were than reproduoed , so that four sets of protocols in the 
same style of handwriting were available. During the scoring session, 
each rater reaeivad a set of 40 protocols, a set of i-esponses to Part 1 
and to Part II of the Utility Test for 20 Individuals. 
jj'ainingj' leth ods 

Ts^o training methods were compared- one labelad the developed 
training method and the other, the usual training method. With regard 
to the usual training Tnethod, little inforinatlon about the rater train- 
ing procedures for divergent production tests is available. Thus, the 
delineation of the usual training mathod was derived from an examination 
of the Utility Test manual. For the purpose of this study, the training 
procedure referred to as the usual training method consisted of the 
following procedure. A rater received a package of training matarials. 
Included in this package were general Instructions for proceeding 
through the materials, a copy of the scoring directions provided in the 
Utility Test maiual, mi a blank, sample test with a coding sheet for 
recording scores. The coding sheet provides spaces for recording the 
names of the individual md the score associated with that individual. 
The rater was itistruoted te rf^ad thf manual and the snmple test cafe- 
fully and to devalop for hlmaalf a method for recording the total nvm- 
ber of aceeptabls responses given by an individual to each part of the 
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ttst. No rationale for the test or thi scoring procedure was provided 
beyond that information Included in the materials. Each rater woi'ked 
indepandintly and was isolated ft^om other waters. No questions specific 
to the scoring proeedure were answered. 

Raters trainad with the developed training method also received 
a package of twining materials . Included In this . package were seneral 
instructions for proca^ding through the materials and a program which 
was designed specif ically' to train raters to score the Utility Test, 
which was described previously. During the training session, each rater 
worked independently and was Isolated ft-om other raters. Again, no 

questions specific to the scoring procedurss were ans^ 
P rocedur e 

Fvom the pool of 20 volunteers, ten were randomly selected to 
be members of the developed training method group. The remaining ten , 
were trained with the usual method, within the two training groins, 
five raters were randomly selected to score the responses to Part I 
first and to Part II sewnd (order 1) and the other five scored proto- 
cols in the order P,art II first and Part I second (order 2). Each rater 
was permittea to select the time and the location for participation at . 
his convenioncQ. When a given rater participated, he „as provided vith 
n package oontnining the appropriate training fflateriais. Each rater 
worked indepQni.ntly and was isolated from other rater who may have 
also selected to participate at that time. After completing the training 
fiesslon, each Mter received the protocols of 20 individuals to score. 
The instructions for the scoring session indicated to ths rater in which 
order. h« was to SGOr= the protocols. That ig. either aii responses to 
Part 1 wore scored first or all responses to Part II Were scored first. 
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No quastions regarding the ecoring prooodura wo^. ,nsw.r.d. 1%, ord.r 
Of th. 20 individuals w„ randomi.od for e«6h rater, itembors of tho ^ 
U.U.1 training method group averaged approxl™,toly ono hour In complating 
both the training and' tha aeoring sessions, while mambera of the devoloped 
training niethod group avar.ged appro.imntely two hours in completing both 

sessions. "Ithin a period of on. we.k, all raters had Participated in 
the study. 



A'JALYSIS 

A. Design 



In the present study, six main sources of variation were 
investigated: type of training procedura, ratara, individuals, test 
part, saquonca in sooting, and order of seoring eaoh test part. In aaeh 
, of the two training procedures, one half of the raters (5 R). scored ths 
responses of 20 Individuals CD to Part I first and to Part II second 
(order I, 11), while the remaining half of the raters scored Part II 
first and Part I second (order II, I), in addition, the scoras were 
assigned over a sequence factor, scores assigned first (sequence 1) 
and scores assigned second (sequence 2). Considering the part, order, 
; and sequence variables, a 2 X 2 X 2 factorial design with eight design 
cells can be generated, as shown in Figure: 2. However, when the nature 
of each^aoll is investigated, certain cell combinations do not exist. 
Given ths ordar I, II, raters can not nossibily .^ora >art I first and 
Part II second. Similarly with tho order II, I, the cells corresponding 
to raters scoring Part I in saquenee 1 and Part 11 in sequence 2 do 
not enist. Those' cross-hatched cells in Figure 2 represent those con- 
ditions Which enist for th. present study, if the rat.r and individual 
dissensions aro added to the existing call combinations indicated abeva, ■ 



FIGURE 2 

Representation of the Present Study 
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the deEign can be represented as in Figure 3* To add the training method 
variable to the dasipi in Figure 3 would duplics^t@ that design so that two 
such designs sxist, one for the developed training Tnethod and one for the 
usual training method* l^hile the Individual diinanslon crosses all of the 
factor levels 5 raters ara nested within order and training methods but are 
crossed with the part , sequence and individual variables, 

FIGURE 3 ' 
Design of the Present Study , Not Including Training Dimension 
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Tha resulting designs ^ is bast described as a fractional hiOTachiMl 

daslCT ands thus, m^iny of tha sources of variation are confounded 

(Cox, 1958, pp, 247-268-, Kirk, 1Q68, pp. 385-387), Confounding in t ^ 

design means that somej or alls of tha sources of variation cnn not be 

SGparated, logically or mathematically, from other sources of variation 

in the design* Table 1 presents the sources of variation, the alias, or 

confoundGd, tarms of tha design, and the eKppcted mean squarQs, The 

training method is indlcrnted i;lth thm letter T; individuals, with Iv 

raters which are nested within orders and training methods with R (OT)^ 

scoring order, with b| test part, with P?, and sequence, with S» The 

remaining terms are the appropriate interaction terms. Nesting factors 

are placed in parenthsiee* In addition to indicating the iources of 

variation, the capital letters have also been used in the ipecif Ication 

of the coefficients for the expected mean squares* For eHample, the 

2 

coefficient of the variance component a-^^ is LPSR, where LPSR equals 
the product of the number of order levels (L)-, the number of test parts 
(P), the number of sequence levels (S) , and the nraiber of raters within 
a given order and training method XR). In stating the linear model for 
the data. Kirk (1968, p, 390) suggests a notation which Includes the alias 
terms. Thus, tho model to be analysed can be stated as 

xTopnt 1 r(ot) -o ps p ^ OS s ^ op 

' , t ir(ot) 'lo ^ ips ip ■ los 
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TOP 


o2 


+ 


P flZ 

" PIR(OT) 


+ LPR 


+ 


^PR(OT) 


+ 




TIO 


TIPS 


o2 


+ 


PS ^IRCOT) 


+ PSR 










TIP 


TIOS 


aZ 




S cfZ 

■ PIRCOT) 


+ LSR 










TIS 


TIOP 


ct2 


+ 


P ffZ 

-PIR(OT) 


^ ^P'^ 4is 










PIR(OT) 


SIRCOT) 


a2 




S cjZ 

PIRCOT) 
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Randoni effects (individuals and raters ) are represented by Latin letters, 
while fiKed affects are indicated by Greek letters. Howaver, any 
interaction term containing a random factor is also a random factor. 
The terms within the braces are the aliases of the cowespopdltig sources 
of vafiation.' ■ 
B. Estimation of Sco rlntf Rel iability • 

V?han the theoretical definition of reliability , the ratio of 
the true score variance to the observed score varifJice, is recalled, the 
problem of partitioning the observed soora variance into its true score 
and error score coniponent a becomes evident. Burt (1955) succinctly 
pointed out the problem when he remarked , 

We have seen that a reliability coefficient is intended to indicate 
the ratio of the estimated variance of the "true" measurenients to 
the actual variance of the observed measurements, i.e. , to the 
"total variance" concaivad as the sum of the "true variance" and 
an "error variance." But how do we know that the value taken In 
the numerator in the ratio just calculated really represents the 
"true variance'' we have in mind, and that it does not incorporate 
something that we might (if we knew its real nature) also mlpht 
regard as arror? (p. 115) ^ 

Estimating reliabiUty by oorrelational methods does not permit the in- 
vestigator to partition the observed score variance, eKcept at a gross 
level. The analysis of variance, on the other hand, allows the possl- ■ 
billty for such partitioning of the observed score, of total variance. 
Through the use. of experimental design and the analysis of variance, 
factors whioh affeot the reliability estimates can be Idantlflod and 
more precise estimates of reliability can, be obtained. The use of 
analysis of variance procedures to estimate test reliability, In general, 
and ratings, specifically, has been sugfested (Hoyt, 1941 j Ibel, 19S1). 
Typically, two sources of variation are idantified.' individuals and 
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test items (test reliability) or individuals and test rGters (rating 
reliability). However, the principles of design can be eKtended to more 
QamplQK designs 5 where a nuntoer of variables can be investigatGd 
(Stanley, 1952), ■ 

To estimato the scoring reliability for each training prooadures, 
the sums of squares and mean squares were computed separately for each 
OToup, The modGl analyzed can ba darivad from formula (1) by eKeluding 
froTTi the model any source of variation which contains the training 
method and eliminating the training method as a nesting variable. The 
eKpected mean squaree for this derived model can be obtained from 
Table 1 in a similar mmner* Table 2 prestnts the summary table used 
in estimating the scoring reliability. 

Given the true and error score model for estimating the average 
reli^ility of ratings with the data analysed in the .analysis of varlanc© 
proeaduret The general formula Is given by 

MS _ MS 

s e 

where MS is the mean square for sidDjeets, MS is the error mem square, 
and. k is the\number of raters. To estimate the average scoring 
rsliabilitias of the two training groups , formula (2) was restated in 
terms of the present design. The appropriate error mean square for the 
individual moan square (MSj) is the mean square for the individual by 
rater interaotlon (ilS^^) and slncQ five raters are nested within the 
. m orders, (k-1) is equal to eight .Substituting the appropriate values 
into formula (2)5 a icorlns raliability of 0, 92432 was obtained for 
tha raters trained with the developed materials , while a sooring relia- 
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TABLE 2 

Summary Table for Rellabilitv Estimates 



Source^ 


df 


^^DTM- 


MS 






I 


19 


4-205.90^ 


221.3632 


S751.4475 


355.3393 


m) 


8 


90.66 


11.3325 


2591.2600 


323.9075 


IR(0) 


152 


303.34 


1.9957 


1140.1400 


7,5009 


0 {PS} 


1 




1.4400 


85,5625 


85.5625 


c tub/ 


1 . 


294.54 


294.5400 


578*4025 


578.4025 


S {OP} 


. 1 


9.00 


9,0000 


6,0025 


6.0025 


10 {IPS} 


19 


27.66 


1.458S 


341.9875 


17.9993 


IP {lOS} 


19 


730.06 


38,4242 


1901.3475 


100.0709 


IS (lOP) 


19 


42.10^ 


2.2158 


75.7475 


3.3795 


OR(0) {SR(0)} 


8 


93.76 


11.7200 


89.3200 


11*1650 


PIR(O) {SIR(O)} 


152 


305.44 


2.0095 


513.6800 _ 


3.7947 


^The 


aliases 


\ of the i 


sources of variation are 


In braces 



DTM indicatss the developed training method. 

3 

UTM Indicates the usual training method. - 

blllty of 0.83755 was obtained with the usual training proaQdurQ. Ebal 
(1951) showed that formula (2) is equivalent to the average interoorrQ- 
lation betwaen all possibla pairs of raters . Thus^ these reliability 
estimates can be thought of as average sstimates of rating reliability 
for each training group • 

EKamining the components In the aKpacted mean sauare, Ebel (1951) 
showed that one could either include the ^^'between-raters^' varianee in 
the formula or could eKclude the term. The inclusion or sKclusion of 
the term, hov/evarl depends upon how the ratinj?s are to be used. 
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Specifically^ the *'betV7QeT5=raters* varlanca should be rsTnoved 
where the final ratings on which decisions are baser"coTisi'st^of 
averages of complete sets of ratings from all observers, or 
ratings whloh hair© been equated from rater to rater such as ranks. 
Z-'Scores, etc. Likewise ^ if comparisons are never made practically 
but only eKperimentally , between ratings of pupils by different 
raters, the ^'betwesn-raters''^ variance should be removed. But if 
decisions are Tnad# in practice by comparing single "raw'^ scores 
assi^ed to different pupils by different raters, or by comparing 
averages which came from different groups of raters , then the 
'■between-raters*' variance should be ' included as part of the error 
terms, {p.i;l2) " 

VThen the - betwera ^raters*' variance is included in the error terms the 
following formula {tbel^ 1951) is appropriate for estimating scorinj^ 

reliability I ' 

(SS t SS ) 



avff '^br'' r ^ - k 1 

: ^ - ^ SS + (SS + SS ) 
s c e 

where SS^ is the sum of squares for subiects^ SS^ is the sum of squares 
for raters g SS^ is the error sum of squares , and k is the number of 
raters. Again ^ substituting the appropriate sums of squares amd the 
value eight for (k-1) into formula (3)5 the scoring reliability estimate 
which includes the ^'between-raters variance is 0*90364 for the 
developGd training method group and 0.62674 for the usual training .^oup 
Formula (3) also provides an average estimate of rating reliability, 
eKcept that the variability between raters is inGluded in the error 
term* Thus, both formulas (2) and (3) are equivalent to the average 
intercorrelation between all pairs of raters of the between raters 
variance is either included in the error term for both computations or 
excluded in both computations. 

However, if one is interested in the reliability of the average 



of ratingp, ^©1 (1951) has shown 'that this reliability estimate is 



where MS_ and MS^ are interpreted as in formula (2). This formula (4) 
can be obtained through the appiiGation of the Spearman --Brovm formula 
to formula (2), If one were to find the average rating assigned ^© each 
individual and thei? to obtain a reliability estimate of these averages, 
this estimate- would be equivalent to that obtained using formula (4). 
A reliability estimate of the average rating is 0*99098 for the developed 
training method and 0,97889 for the usual training method, 

Iflien the three reliability estimates of the two rater training 
groups are compared, two results become evident- First, the reliability 
estimates for the developed training method group are consistently 
higher than those obtained for the usual training method group* In 
addition s the reliability estimates for th© developed training method 
group are greater than 0.90 , while in the usual method groups only the 
estimate for the reliability of the average soore is greater than 0*90, 
The second factor to consider is a comparison between the two training 
groups of the average reliability estimate when the between rater 
variance is eKoluded and vrhen it is Included/ In general , when the - 

. .. . . . ...... 

between rater variance la Included ^ the reliability estimate Is lower 
than when the estimate eKcludes this variance. This reduction in 
reliability is reasonable 5 since more variability is being included into 
the error component. For the present study ^ when the between rater 
variance is included, the reliability estimate is reduced from 0.8375U 
to 0,62674 In the usual training method /while the reduction in the \ 
estimate for the developed training method is from 0.92432 to 0,903639. 
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Typcially, the SGoriiig reliability estimates reported are based on the 
use of computed formula (2)* This differential reduction in .the scoring 
reliability estimates serves to indicate thu presence of more variability 
among the raters who were trained with the usual procedure than among - 
those trained with the developed procedure* 
C, Analysis of Training Dasl^' 

The results of the training design are presented in Table 3, 
As shown in Table 1^ no appropriate mean squares for the denorainator of 
an P ratio are available for some sources of variation. For example, 
no other source of variation has the expected mean square of + 
PSRT <y|Q + PS €7^^^Q^j + PSI ^IfQ^jj which would provide the appropriate 
denominator of an F ratio to test the 0 sourca of variation* In order 
to test those sources for which no appropriate denominator was avail^le^ 
quasl--F ratios (Kirk, 1968, pp, 212-214) wore formed and the degrees of 
freedom for these ratios were coinputed, as indicated In Table 3, Since 
the design contains only one observation per cell, no direct estimate of 
the within cell variability is available* However V the tests for the 
I J T,' IRCOT)^ and IT sources of variation required an estimate of this 
variability* Therefore, the highest order interaction 9 PIR(OT) , was 
assmned to be zero and the mean sqnare associated with this interaction 
was used as the estimate of the within cell variability. To Bvmum^izB 
the results shown in Table 3, three areas of interest can be identified: 
the effects of training, the effects of the rating situation, and the 
effects associated with the test part* 

Estimating the mean score assipied in each of the two training 
groups V the mean score of the eKperlmental group is estimated as 7,35 
and the mean score of the contrbl ^oup ii 13.0725 * The significant 



23 



TABLE 3 





ANOVA Summary Table 


for the Training Des 


ign 




Source 


df 


ss 


!IS 


^ F 






I 


IS 


: 9424.. 02375 


496.00125 


5*83614 


19,21^ 


<0.001 


R(OT) 


16 


2681.92000 


167.62000 


35.30113 


15 5 304 


<0.001 


0{PS} 


1 


32.40125 


32.40125 


,2100^4^ 


2 

1,18 


0.652 


P{OS} 


1 


794.01125" 


794,01123 


49.20734 


1,31^ 


<.0.001 


S{OP} 


1 


.15125 


.15125 


.20005 


173,23^ 


>0, 999 


T 


1 


6549.4012S 


6549,40125 


26.38556^ 


2 

1,29 


<0.001 


IR(OT) 


301* 


1443.48000 


4,74829 


1.76224 


304,304 


<0.001 


lOdPSJ 


19 


178.82375 


9.41178 


1.98214 


19 , 304 


0.009 


IPCIOS} 


19 


1995.51375 


105.02704 


38.97874 


19,304 


<0.001 


IS{10P} 


19 


52.87375 


2.78283 


1.03279 


19,304 


0.423 


IT 


19 


1533.32375 


80.70125 


29.95069 


19,304 


<0.001 


PR(0T){SR(OT)} 


16 


183.08000 


11.44250 


4.26666 


16,304 


<0.001 


TO{TPS} 


1 


54.60125 


54.60125 


,.33406^ 


2 

;.i8'^ 


0.570 


TP{TOS} 


1 


34.03125 


34.03125 


.81775^ 


2 

1,30 


0.373 


TS{TOP} 


, • 1 


14.85125 


14 . 85125 . 


,81457^ 


1,25^ 


0,375 


TIOfTIPS} 


: 19 


190.82375 


10.04336 


2.11515 


19,304 


0,005 


TIP{TIOS} • 


19 


635.89375 


33.46809 


12.48783 


19,304 


<0.001 


TIS{TIOP} 


19 


54.97395 


3,41967 


1.26914 


19,304 


,0.202 . 


PIR(0T){SIR(OT)} 304 


819.12000 


2,69447 









Quasi-F ratio of tha general form 



Degreas of freadom for the numerator and denonilnator , respsctivsly , 
of the quaal-P ratio 
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differenGe between the mean scora assigned in the two ti^alning groups 
is mont likely assoolated with the definition of fluency developed in 
the training materials. The definition is more preciso and, with the 
inalusion of the response charaGteristic of duplication, movQ restrictiva 
than the definition provided in the Utility Test manual. However ^ given 
the discrqpancies in the definition of ideational fluency across tests 
which support to measure this factors the restriction provided in the 
developed training materials Is appropriate. In' addition the presence 
of the signif leant IT, TIP, and TIO interactions indicates that the 
training proeedura has a diffirential ©ffeot in eonjunction with differ-" 
ent individuals and combinations of test part and scoring order. The 
nature of confoimding, however , complicates' this analysis. Specifically, 
the fact that the TIP and TIO sources are aliased with the TIOS and 
TIPS sources raspectlvely, does not allow for a precise Interpretation 
of the interactions* To analyse these interactive effects would 
require further Investigations with unconfounded designs. 

The results of the present study also provide ineights into the 
general rating situation. The presence of a significant individual 
effect indicates variability among the sources associated with individuals 
in the population mi m estimate of this vari^lllty, is 5,1656. 
Guilford Cl96tf) has termed this vari^llity in the rating situation 
absolute halo, which reflects true variation among individuals being 
rated. The presence of the signifleant rater effect indicates variability 
among the raters and an estimate, 5|, is 2.0359. Rater variability has , 
been termidleniincy error by Guilford (1964) and can be interpreted 
as systematic differences between raters in the scores asslj^ed. The 
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estimatei of the average scoring reliability which ineludes ratar 
variance would indicate that more rater variability is associated with 
raters trained with the usual method than those trained with the 
developed method. On the bails of this information one can conjecture 
: that the developed training procedure reduces the between rater 
variability* 

In addition to the slgnifidant Individual and rater niain effects, 
two interactions involving the rater dimension were also significant! 
IR(OT) and PR(OT). The IR(OT) interaction indicates the tendency for 
raters to rate individuals differentially and has been termed the 
relatlva halo effect (Guilford, 196i+). Further investigation of this 
relative halo effeet would be facilitated by including the level of 
rater creativity into the design. In other words, the tendency of 
raters to rate individuals differerttlally may be a function of the 
rater's own capacity. Similarly, the PR .(OT) interaction 
indicates the tendency for raters to score the test parts differentially 
md provides evidence for the presence of a contrast rating error 
(Guilford, 196i+), However, this interaction Is confounded with the 
SRCOT) interaction, so that the interpretation of the PR(OT) as 
reflecting a contrast error is only tentative* 

Finally, a si^lficant difference between the scores assipied to 
the test parts Is evidenced. An estimate of the mean aisoclated with 
the brick task Is 11,2075 md of the mean associated with the wooden 
pencil task, 9.2150, Although the part main effect is confoimded with 
the order/sequence Interaction ^ the presence of this difference should 
necessitate the reconsideration of the task equivalency. Researchers 
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have assumed the equivalency of the brick and wooden pencil tasks. 
However, this aquivalency la questionable in light of the siBnificant 
P main effact and requires additional invest Igat ion. Ths prasence of a 
significant IP intaraction, which is confounded with tha lOS interactipn, 
would Indicata ths diffarentlal raaponse of Individuals to the test 
parts. Apain, this Interpretation is contingent upon the presence or 
absence of the lOS intaraetion. 



CONCLUSIONS 
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In summary, the two major effects of the developed training 
procedure were to maintain scoring reliability at a level graater than 
0.90 and to reduce the average fluency score assigned. As pointed out 
previously, the astJmates of the scoring raliability for raters trained 
with the developed training mathod are consistently higher than for 
those trained with the usual method. In addition, the scoring reliabillt3 
estimates are maintained at a level which is of practical significance 
in the further use of the. Utility Test. This level of scoring re- 
liability is maintained even when the between rater variance is' included- 
in the reliability estimate. Such results are not obtained when the 
scoring reliabilities for raters trained with the usual procedure are 
estimated. These results st^^ongly suggest the effactlvenass of training 
raters to score protocols for ideational fluency. 

The. training procedure developed for the present, study consisted 
of a number of conponents.- the inolualon of the rationala and theory of 
the Utility Test, the definition of the rasponse oharactaristics of 
relevanGe and duplication, scoring praetiea, and discussion involving 
scoring discrepancies. To Indicate which factor in thg tmlnlng 
procedure produced the results would require: additional invaitigatioM . ■ ; 
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However 5 one can consider the generallsabillty of the training model 
utilised in the present study to the other measures of creativity 
available, as well as to the other factors of creativity* The present 
study has involved only the factor of ideational fluency r the factors 
of fleKibility, elaboration ^ and orginality and the associated tests 
were not considered. To apply the training procedura model would 
necessitate a careful and precise delineation of the factors involved, 
This definitional process is highly recommended. Unless the raters are 
provided with clear guidelines for scoring the factors and with practice, 
one would expect rater variability to be greater than whan raters are 
provided with such information, In other words ^ the definition of the 
scoring categories and practice In using these categories is seen as 
an integral part of training. 

Many issues in tha area of creativity research are still 
unresolved* For eHample, the establishment of creative mental abilities 
as a construct distinct from that of Intelligence has not been confirmed 
(McNemar, 1964). Also, the relationships between intelligence ^ creative 
mental abilities, and academic achievement (Shin, 1971) are yet unclear. 
Given the evidence for scoring unreliability 9 one can Gonjecture that 
the relationships and research Inconsistencies might be made more 
definitive I if scoring reliability were improved, 
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